Techniques and tools for measuring energy efficiency of scientific software applications
نویسندگان
چکیده
The scale of scientific High Performance Computing (HPC) and High Throughput Computing (HTC) has increased significantly in recent years, and is becoming sensitive to total energy use and cost. Energy-efficiency has thus become an important concern in scientific fields such as High Energy Physics (HEP). There has been a growing interest in utilizing alternate architectures, such as low power ARM processors, to replace traditional Intel x86 architectures. Nevertheless, even though such solutions have been successfully used in mobile applications with low I/O and memory demands, it is unclear if they are suitable and more energy-efficient in the scientific computing environment. Furthermore, there is a lack of tools and experience to derive and compare power consumption between the architectures for various workloads, and eventually to support software optimizations for energy efficiency. To that end, we have performed several physical and software-based measurements of workloads from HEP applications running on ARM and Intel architectures, and compare their power consumption and performance. We leverage several profiling tools (both in hardware and software) to extract different characteristics of the power use. We report the results of these measurements and the experience gained in developing a set of measurement techniques and profiling tools to accurately assess the power consumption for scientific workloads. 1. Introduction The Large Hadron Collider (LHC) [1] at the European Laboratory for Particle Physics (CERN) in Geneva, Switzerland, is an example of a scientific project whose computing resource requirements are larger that those likely to provided in a single computer center. Data processing and storage are distributed across the Worldwide LHC Computing Grid (WLCG) [2], which uses resources from 160 computer centers in 35 countries. Such computational resources have enabled the CMS [3] and ATLAS [4] experiments to discover the Higgs Boson [5, 6], for example. The WLHC requires a massive amount of computational resources (250,000 x86 cores in 2012) and, proportionally, energy. In the future, with planned increases to the LHC luminosity [7], the ar X iv :1 41 0. 34 40 v1 [ cs .D C ] 1 0 O ct 2 01 4 dataset size will increase by 2-3 orders of magnitude, presenting even more challenges in terms of energy consumption. In order to find and develop better solutions for improving energy efficiency in High Energy Physics (HEP) computing, it is important to understand how energy is used by the HEP systems themselves. We describe several tools and techniques that facilitate researchers to reach that goal. As energy efficiency becomes a concern, new solutions have been considered to develop energy efficient systems. One potential solution is to replace the traditional Intel x86 architectures by low power architectures such as ARM. A comparison of the energy efficiency between ARMv7 and x86 Intel architecture is conducted in this article. The experiments use CMS workloads and rely on the techniques and tools described earlier to perform the measurements. This article is structured as following. Firstly, we describe where is energy consumed in a HTC system and outline some of the tools and techniques available to measure and monitor energy consumption on HTC systems (Section 2). Secondly, we present the results of a comparison between ARMv7 and Intel Xeon architecture using CMS workloads (Section 3). Finally, we present IgProf, a general purpose, open source application performance profile. In addition, we describe its recent added energy profiling features and 64-bit ARM support. 2. Tools and techniques for energy measurement When optimizing power usage, there are two granularities at which one can look at a computing system. The coarser granularity takes into account the behavior of the whole node (or some of its passive parts, e.g. the transformer) as part of a rack in a datacenter. This is usually investigated when engineering and optimizing computing centers. Alternatively, a more detailed approach is to look into the components which make up the active parts of a node, in particular the CPU and its memory subsystem since these are responsible for a sizeable fraction of the consumed power. They are also the place where the largest gains in terms of efficiency can be obtained through optimizations in the software. If one is simply interested in the coarse power consumption by node, external probing devices can be used: monitoring interfaces of the rack power distribution units, plugin meters and noninvasive clamp meters (allowing measurement of the current pulled by the system by induction without making physical contact with it). They differ mostly in terms of flexibility. Their accuracy is typically a few percent for power, whereas their time resolution is in the order of seconds. This is more than enough to optimize electrical layout of the datacenters or to provide a baseline for more detailed studies. A alternative approach takes into account the internal structure of a computing element of an HTC system, as shown in figure 1. Nowadays, every board manufacturer provides on-board chips which monitor energy consumption of different components of the system. These allow energy measurements of fine grained detail, as it is possible to individually monitor energy consumption of components such as the CPU, its memory subsystem, and others. An example of this chip monitors is the Texas Instruments TI INA231 [8] current-churn and power monitor which is found on the ARMv7 developer board which we used for our studies. It is quite common in the industry. Compared to external methods, these on-board components provide high accuracy and reasonably high precision measurements (millisecond level). A special and slightly different case of these on-board monitors is a new technology called Running Average Power Limit (RAPL), provided by Intel beginning from the Sandy Bridge family of processors. Contrary to other solutions, which are implemented as discrete chips, RAPL is embedded as part of the CPU package itself and provides information on the CPUs own subsystems. In particular RAPL provides data for three different domains: package (pck), which measures energy consumed by the system’s sockets, power plane 0 (pp0), which measures energy Figure 1. Components that contribute for power consumption in HPC consumed by the CPU core(s), and dram, which accounts for the sum of energy consumed by memory in a given socket, therefore excluding the on-core caches [9]. As for the discrete components case, the timing resolution of measurements is in the millisecond range [10]. This is fine enough to permit exploiting such data to build an energy consumption sampling profiler for applications, similar to how performance sampling profilers work (see section 4). Finally, in addition to power monitoring of the sockets, RAPL can limit the power consumed by the different domains. This feature, usually referred as power capping, allows the user to define the average power consumption limit of a domain in a defined time window and allows more accurate independent measurements of the non limited components. 3. Power efficiency measurements with x86-64 and ARMv7 In this section, we demonstrate the potential of some of the tools we previously described. To that end, we perform several measurements of workloads from CERN, running on different architectures. The workloads used in the experiment run on top of Intel x86-64 architecture, traditionally used in HTC and data centers and 32 bit ARMv7 architectures (for similar studies for 64bit ARMv8 and Xeon Phi, please refer to [11]). The ARM architecture, initially developed for mobile devices, has been considered [12, 13] as a potential alternative to Intel in HTC, given its energy efficient computing. We also present a brief comparison between ARM and Intel architectures from the energy consumption perspective, based on the results obtained. 3.1. Tools and techniques For the Intel architecture, we used the RAPL technology to perform measurements of the energy consumed by the package, DRAM and cores (figure 1). The external measurements for the baseline were performed using a rack PDU, which provides an online API to gather the energy consumed by the system on the rack at a sampling rate of 1 second. For the ARM board, we used the Texas Instrument power monitor chip TI INA231 which allows reading of the energy consumed by the cores and dram at a sampling rate of microseconds. The chip was embedded in the board from the vendor. For the external measurements, we used an external plug-in power monitor with a computer interface for gathering and storing the results. In both cases we read the data as it was exposed to the system via the sysfs / devfs knobs. The machine specifications can be seen in figure 2. Figure 2. Machine’s specifications 3.2. Experiment setup The workload used for the experiments was ParFullCMS, a multi-threaded Geant4 [14] benchmark application which uses a complex CMS geometry for its simulation. Using ParFullCMS, we ran simulation tasks on both the Intel and ARM machines (figure 3). The workflow was run several times with different number of threads in each machine. The number of threads run in each experiment is chosen according to the number of the cores of the machines. 3.3. Analysis As expected, the ARMv7 architecture shows encouraging results from the energy efficiency perspective than Intel in all the experiments performed. Also as expected, both architectures do not perform better when overcommitted (more threads than the physical number of cores). Notice ARM results when overcommitted (8 threads) are much worse then Intel ones. This is due to the relatively modest amount of available DRAM (figure 2), causing the machine started swapping, greatly affecting performance. While this was expected, since the ARM system used is just a development board for mobile applications, this is a clear indication that when doing a final assessment of power efficiency for an architecture, one needs to have a full server-grade system in order to make a proper comparison. 4. Profiling for power efficiency The hardware components described above provide measurements that are related to the full set of processes running on the machine. For the simple case where only a single benchmark application is running, these can be used to make comparisons between architectures. A further step is to try to see if there is a way to map the energy consumption measurements to functions and methods within an executing process. Such a mapping would allow for optimizations of the software itself. This kind of mapping can be done in two different ways, which we call instrumentation and sampling profiling. In the instrumentation case, effective readings of profiled quantities (e.g. energy consumption), or quantities correlated with them (e.g. CPU power state transitions) are done at the beginning and the end of a profiled task and the difference between the two is used to estimate average power consumption over that period of time. By bookkeeping starting and stop values for monitored tasks one can get a fairly complete picture of what is happening to the system, provided the measured interval is large compared to the temporal precision of the measure being done. This is both to avoid a large error on the average estimation and to reduce performance overhead due to the measure itself. Figure 3. Chip monitor and external measurements results. The results are shown according to the relation events per number of cores of each machine and their absolute number of cores. Sampling profiling, on the other hand, has a different approach where a given quantity is sampled regularly and at each sample the measured quantity is accumulated until it overflows a user provided limit. When this happens the profiler increments a counter for the process / function being executed in that precise moment. Assuming that the distribution of where time is spent in a system is constant over time (which is typically true for large data processing tasks), such a sampling algorithm converges to the actual distribution of the measured quantity. The advantage of this approach is that the fidelity of the measurement to first approximation depends only on the number of samples made, regardless of the error on the profiled quantity. This also allows minimizing the performance overhead by tuning the sampling period to be much larger than the measurement itself. IgProf is a general purpose, open source application performance profiler. It was developed in HEP, but it is capable of profiling all types of software applications. The profiler has been available on the x86 and x86-64 platforms since many years [15, 16], and recently we have also ported it to ARMv7 and ARMv8. Moreover we have now added a statistical sampling energy profiling module which provides function level energy cost distribution [17]. Such a module uses the PAPI library to read energy measurements from the RAPL interface previously described. To illustrate the new module, we use it to profile the memory benchmark STREAM [18]. Figure 4 compares the results from performance and energy profiling of the benchmarking tool. The Xaxis describes the four main functions contributing to the execution time and energy consumption of the stream tool: Add, Copy, Triad and Scale. The left scale of the Y-axis and the perf ticks series describe the execution time spent in each function, whereas the right scale of the Y-axis and the nrg pkg, nrg pp0 and nrg pp1 series describe the amount of energy spent in each function. The energy consumption of the processor package domain and the power plane 0 (describing the CPU cores) seem to follow the time spent in the functions, whereas the energy consumption of power plane 1 seems to be fairly constant to zero (describing the unused GPU) . As we would expect from a simple benchmark, the profiling results of a simple single-threaded application shows a correlation between the execution time and the energy spent in a function. While the energy profiling module is now fully functional within IgProf, further work needs to be done to tune the measurements and to gain experience with how to use the profiles obtained 0 100 200 300 400 500 600 700 0 5 10 15 20 25 30 35 40 Add Copy Triad Scale En er gy (J ) Ex ec u. on . m e (s )
منابع مشابه
EVALUATING EFFICIENCY OF BIG-BANG BIG-CRUNCH ALGORITHM IN BENCHMARK ENGINEERING OPTIMIZATION PROBLEMS
Engineering optimization needs easy-to-use and efficient optimization tools that can be employed for practical purposes. In this context, stochastic search techniques have good reputation and wide acceptability as being powerful tools for solving complex engineering optimization problems. However, increased complexity of some metaheuristic algorithms sometimes makes it difficult for engineers t...
متن کاملAn Interleaved Configuration of Modified KY Converter with High Conversion Ratio for Renewable Energy Applications; Design, Analysis and Implementation
In this paper, a new high efficiency, high step-up, non-isolated, interleaved DC-DC converter for renewable energy applications is presented. In the suggested topology, two modified step-up KY converters are interleaved to obtain a high conversion ratio without the use of coupled inductors. In comparison with the conventional interleaved DC-DC converters such as boost, buck-boost, SEPIC, ZETA a...
متن کاملA Method for Measuring Energy Consumption in IaaS Cloud
The ability to measure the energy consumed by cloud infrastructure is a crucial step towards the development of energy efficiency policies in the cloud infrastructure. There are hardware-based and software-based methods of measuring energy usage in cloud infrastructure. However, most hardware-based energy measurement methods measure the energy consumed system-wide - including the energy lost in...
متن کاملBridging the semantic gap for software effort estimation by hierarchical feature selection techniques
Software project management is one of the significant activates in the software development process. Software Development Effort Estimation (SDEE) is a challenging task in the software project management. SDEE is an old activity in computer industry from 1940s and has been reviewed several times. A SDEE model is appropriate if it provides the accuracy and confidence simultaneously before softwa...
متن کاملTranslation Invariant Approach for Measuring Similarity of Signals
In many signal processing applications, an appropriate measure to compare two signals plays a fundamental role in both implementing the algorithm and evaluating its performance. Several techniques have been introduced in literature as similarity measures. However, the existing measures are often either impractical for some applications or they have unsatisfactory results in some other applicati...
متن کاملTranslation Invariant Approach for Measuring Similarity of Signals
In many signal processing applications, an appropriate measure to compare two signals plays a fundamental role in both implementing the algorithm and evaluating its performance. Several techniques have been introduced in literature as similarity measures. However, the existing measures are often either impractical for some applications or they have unsatisfactory results in some other applicati...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1410.3440 شماره
صفحات -
تاریخ انتشار 2014